## Uncomment and run this cell to install pandas and numpy
#!pip install pandas numpy
Cleaning the weather dataset
In this notebook, we’ll be using numpy and pandas.
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool.
Numpy is the fundamental package for scientific computing with Python.
Let’s install the packages pandas and numpy.
# import the libraries
import pandas as pd
import numpy as np
from dataidea.datasets import loadDataset
Let’s check the versions of python, numpy and pandas we’ll be using for this notebook
# checking python version
print('Python Version: ',)
!python --version
Python Version:
Python 3.10.12
# Checking numpy and pandas versions
print('Pandas Version: ', pd.__version__)
print('Numpy Version: ', np.__version__)
Pandas Version: 2.2.1
Numpy Version: 1.26.4
Let’s load the dataset. We’ll be using a weather dataset that imagined for learning purposes.
# load the dataset
= loadDataset('weather') weather_data
We can sample out random rows from the dataset using the sample()
method, we can use the n
parameter to specify the number of rows to sample
# sample out random values from the dataset
=5) weather_data.sample(n
day | temperature | windspead | event | |
---|---|---|---|---|
2 | 05/01/2017 | 28.0 | NaN | Snow |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
4 | 07/01/2017 | 32.0 | NaN | Rain |
Display some info about the dataset eg number of entries, count of non-null values and variable datatypes using the info()
method
# get quick dataframe info
weather_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 day 9 non-null object
1 temperature 5 non-null float64
2 windspead 5 non-null float64
3 event 7 non-null object
dtypes: float64(2), object(2)
memory usage: 416.0+ bytes
We can count all missing values in each column in our dataframe by using dataframe.isna().sum()
, eg
# count missing values in each column
sum() weather_data.isna().
day 0
temperature 4
windspead 4
event 2
dtype: int64
We can use a boolean-indexing like technique to find all rows in a dataset with missing values in a specific column.
# get rows with missing data in temperature
weather_data[weather_data.temperature.isna()]
day | temperature | windspead | event | |
---|---|---|---|---|
1 | 04/01/2017 | NaN | 9.0 | Sunny |
3 | 06/01/2017 | NaN | 7.0 | NaN |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | NaN |
# get rows with missing data in event column
weather_data[weather_data.event.isna()]
day | temperature | windspead | event | |
---|---|---|---|---|
3 | 06/01/2017 | NaN | 7.0 | NaN |
6 | 09/01/2017 | NaN | NaN | NaN |
For the next part, we would like to demonstrate forward fill (ffill()
) and backward fill (bfill
), we first create two copies of the dataframe to avoid modifying our original copy in memory. - ffill()
fills the missing values with the previous valid value in the column - bfill()
fills the missing values with the next valid value in the column
# Create copies of a dataframe
= weather_data.copy()
weather_data1 = weather_data.copy() weather_data2
# fill with the previous valid value
'event'] = weather_data1.event.ffill()
weather_data1[ weather_data1
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | NaN | 7.0 | Snow |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | Sunny |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
weather_data
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | NaN | 7.0 | NaN |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | NaN |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
# fill with the next valid value in the column
'event'] = weather_data2.event.bfill()
weather_data2[ weather_data2
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | NaN | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | NaN | 7.0 | Rain |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | Cloudy |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
We can modify (or fill) a specific value in the dataframe by using the loc[]
method. This picks the value by its row (index) and column names. Assigning it a new value modifies it in the dataframe as illustrated below
# modify a specific value in the dataframe
1, 'temperature'] = 29
weather_data2.loc[ weather_data2
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | 29.0 | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | NaN | 7.0 | Rain |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | NaN | NaN | Sunny |
6 | 09/01/2017 | NaN | NaN | Cloudy |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
We can use the fillna()
method to replace all missing values in a column with a specific value as demostrated value
# replace missing values in temperature column with mean
'temperature'] = weather_data2.temperature.fillna(
weather_data2[=weather_data2.temperature.mean()
value
) weather_data2
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | 29.0 | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | NaN | Snow |
3 | 06/01/2017 | 32.5 | 7.0 | Rain |
4 | 07/01/2017 | 32.0 | NaN | Rain |
5 | 08/01/2017 | 32.5 | NaN | Sunny |
6 | 09/01/2017 | 32.5 | NaN | Cloudy |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
# create a copy of weather_data2
= weather_data2.copy() weather_data22
# Replace missing values in windspead column with a specific value
'windspead'] = weather_data2.windspead.fillna(value=7.5)
weather_data22[ weather_data22
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | 29.0 | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | 7.5 | Snow |
3 | 06/01/2017 | 32.5 | 7.0 | Rain |
4 | 07/01/2017 | 32.0 | 7.5 | Rain |
5 | 08/01/2017 | 32.5 | 7.5 | Sunny |
6 | 09/01/2017 | 32.5 | 7.5 | Cloudy |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
We can also use the fillna()
method to fill missing values in multiple columns by passing in the dictionary of key/value pairs of column-name and value to replace. Before we demonstrate this, let’s create a copy of the dataframe to avoid modifying the original in memory
# create a copy of the weather_data dataframe
= weather_data.copy() weather_data3
# Replace missing values in temperature, column and event
={
weather_data3.fillna(value'temperature': weather_data3.temperature.mean(),
'windspead': weather_data3.windspead.max(),
'event': weather_data3.event.bfill()
=True) }, inplace
weather_data3
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
1 | 04/01/2017 | 33.2 | 9.0 | Sunny |
2 | 05/01/2017 | 28.0 | 12.0 | Snow |
3 | 06/01/2017 | 33.2 | 7.0 | Rain |
4 | 07/01/2017 | 32.0 | 12.0 | Rain |
5 | 08/01/2017 | 33.2 | 12.0 | Sunny |
6 | 09/01/2017 | 33.2 | 12.0 | Cloudy |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |
We can optionally drop all rows with missing values using the dropna()
method. Before we demonstrate that, let’s first create a dataframe copy to avoid modifying the original in the memory
# create a copy of weather_data dataframe
= weather_data.copy() weather_data4
# Drop all rows with missing values
weather_data4.dropna()
day | temperature | windspead | event | |
---|---|---|---|---|
0 | 01/01/2017 | 32.0 | 6.0 | Rain |
7 | 10/01/2017 | 34.0 | 8.0 | Cloudy |
8 | 11/01/2017 | 40.0 | 12.0 | Sunny |